---
title: Apache Airflow
description: How to use the DataRobot Provider for Apache Airflow to implement a basic DAG orchestrating an end-to-end DataRobot AI pipeline.

---


# DataRobot provider for Apache Airflow

The combined capabilities of [DataRobot MLOps](mlops/index) and [Apache Airflow](https://airflow.apache.org/docs/){ target=_blank } provide a reliable solution for retraining and redeploying your models. For example, you can retrain and redeploy your models on a schedule, on model performance degradation, or using a sensor that triggers the pipeline in the presence of new data. This quickstart guide on the DataRobot provider for Apache Airflow illustrates the setup and configuration process by implementing a basic [Apache Airflow DAG (Directed Acyclic Graph)](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html){ target=_blank } to orchestrate an end-to-end DataRobot AI pipeline. This pipeline includes creating a project, training models, deploying a model, scoring predictions, and returning target and feature drift data. In addition, this guide shows you how to import [example DAG files](https://github.com/datarobot/airflow-provider-datarobot/tree/main/datarobot_provider/example_dags){ target=_blank } from the `airflow-provider-datarobot` repository so that you can quickly implement a variety of DataRobot pipelines.

The DataRobot provider for Apache Airflow is a Python package built from [source code available in a public GitHub repository](https://github.com/datarobot/airflow-provider-datarobot){ target=_blank } and [published in PyPi (The Python Package Index)](https://pypi.org/project/airflow-provider-datarobot/){ target=_blank }. It is also [listed in the Astronomer Registry](https://registry.astronomer.io/providers/datarobot/versions/latest){ target=_blank }. For more information on using and developing provider packages, see the [Apache Airflow documentation](https://airflow.apache.org/docs/apache-airflow-providers/index.html){ target=_blank }. The integration uses [the DataRobot Python API Client](https://pypi.org/project/datarobot/){ target=_blank }, which communicates with DataRobot instances via REST API. For more information, see [the DataRobot Python package documentation](https://datarobot-public-api-client.readthedocs-hosted.com/en/latest-release/){ target=_blank }.

## Install the prerequisites {: #install-the-prerequisites }

The DataRobot provider for Apache Airflow requires an environment with the following dependencies installed:

* [Apache Airflow](https://pypi.org/project/apache-airflow/){ target=_blank } >= 2.3

* [DataRobot Python API Client](https://pypi.org/project/datarobot/){ target=_blank } >= 3.2.0b1

To install the DataRobot provider, you can run the following command:

``` sh
pip install airflow-provider-datarobot
```

Before you start the tutorial, install the [Astronomer command line interface (CLI) tool](https://github.com/astronomer/astro-cli#readme){ target=_blank } to manage your local Airflow instance:

=== "MacOS"

    First, install Docker Desktop for [MacOS](https://docs.docker.com/desktop/install/mac-install/){ target=_blank }.

    Then, run the following command:

    ``` sh
    brew install astro
    ```

=== "Linux"

    First, install Docker Desktop for [Linux](https://docs.docker.com/desktop/install/linux-install/){ target=_blank }.

    Then, run the following command:

    ``` sh
    curl -sSL https://install.astronomer.io | sudo bash
    ```

=== "Windows"

    First, install Docker Desktop for [Windows](https://docs.docker.com/desktop/install/windows-install/){ target=_blank }.

    Then, see the [Astro CLI README](https://github.com/astronomer/astro-cli#windows){ target=_blank }.


Next, install [pyenv](https://github.com/pyenv/pyenv#simple-python-version-management-pyenv){ target=_blank } or another Python version manager.


## Initialize a local Airflow project {: #initialize-a-local-airflow-project }

After you complete the installation prerequisites, you can create a new directory and initialize a local Airflow project there with [AstroCLI](https://github.com/astronomer/astro-cli#get-started){ target=_blank }:

1. Create a new directory and navigate to it:

    ``` sh
    mkdir airflow-provider-datarobot && cd airflow-provider-datarobot
    ```

2. Run the following command within the new directory, initializing a new project with the required files:

    ``` sh
    astro dev init
    ```

3. Navigate to the `requirements.txt` file and add the following content:

    ``` txt
    airflow-provider-datarobot
    ```

4. Run the following command to start a local Airflow instance in a Docker container:

    ``` sh
    astro dev start
    ```

5. Once the installation is complete and the web server starts (after approximately one minute), you should be able to access Airflow at `http://localhost:8080/`.

6. Sign in to Airflow. The Airflow **DAGs** page appears.


    ![](images/airflow-dags-page.png)


## Load example DAGs into Airflow {: #load-example-dags-into-airflow }

The example DAGs _don't_ appear on the **DAGs** page by default. To make the DataRobot provider for Apache Airflow's example DAGs available:

1. Download the DAG files from the [airflow-provider-datarobot](https://github.com/datarobot/airflow-provider-datarobot/tree/main/datarobot_provider/example_dags){ target=_blank } repository.

2. Copy the [`datarobot_pipeline_dag.py` Airflow DAG](https://github.com/datarobot/airflow-provider-datarobot/blob/main/datarobot_provider/example_dags/datarobot_pipeline_dag.py){ target=_blank } (or the entire `datarobot_provider/example_dags` directory) to your project.

3. Wait a minute or two and refresh the page.

    The example DAGs appear on the **DAGs** page, including the **datarobot_pipeline** DAG:

    ![](images/airflow-example-dags.png)


## Create a connection from Airflow to DataRobot {: #create-a-connection-from-airflow-to-datarobot }
 
The next step is to create a connection from Airflow to DataRobot:

1. Click **Admin > Connections** to [add an Airflow connection](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#creating-a-connection-with-the-ui){ target=_blank }.

2. On the **List Connection** page, click **+ Add a new record**.

3. In the **Add Connection** dialog box, configure the following fields:

    ![](images/airflow-add-connection.png)

    Field           | Description
    ----------------|-------------
    Connection Id   | `datarobot_default` (this name is used by default in all operators)
    Connection Type | DataRobot
    API Key         | A DataRobot API token ([locate or create an API key in **Developer Tools**](api-key-mgmt#api-key-management))
    DataRobot endpoint URL | `https://app.datarobot.com/api/v2` by default

4. Click **Test** to establish a test connection between Airflow and DataRobot.

5. When the connection test is successful, click **Save**.


## Configure the DataRobot pipeline DAG {: #configure-the-datarobot-pipeline-dag }

The [datarobot_pipeline Airflow DAG](https://github.com/datarobot/airflow-provider-datarobot/blob/main/datarobot_provider/example_dags/datarobot_pipeline_dag.py){ target=_blank } contains operators and sensors that automate the DataRobot pipeline steps. Each operator initiates a specific job, and each sensor waits for a predetermined action to complete:

Operator                       | Job
-------------------------------|-----------------------------------------------
CreateProjectOperator	       | Creates a DataRobot project and returns its ID
TrainModelsOperator	           | Triggers DataRobot Autopilot to train models
DeployModelOperator	           | Deploys a specified model and returns the deployment ID
DeployRecommendedModelOperator | Deploys a recommended model and returns the deployment ID
ScorePredictionsOperator	   | Scores predictions against the deployment and returns a batch prediction job ID
AutopilotCompleteSensor	       | Senses if Autopilot completed
ScoringCompleteSensor	       | Senses if batch scoring completed
GetTargetDriftOperator         | Returns the target drift from a deployment
GetFeatureDriftOperator        | Returns the feature drift from a deployment

!!! note
    This example pipeline doesn't use every available operator or sensor; for more information, see the [Operators](https://github.com/datarobot/airflow-provider-datarobot/tree/main#operators){ target=_blank } and [Sensors](https://github.com/datarobot/airflow-provider-datarobot/tree/main#sensors){ target=_blank } documentation in the project `README`.

Each operator in the DataRobot pipeline requires specific parameters. You define these parameters in a configuration JSON file and provide the JSON when running the DAG.

``` json
{
    "training_data": "local-path-to-training-data-or-s3-presigned-url-",
    "project_name": "Project created from Airflow",
    "autopilot_settings": {
        "target": "readmitted",
        "mode": "quick",
        "max_wait": 3600
    },
    "deployment_label": "Deployment created from Airflow",
    "score_settings": {}
}
```

The parameters from `autopilot_settings` are passed directly into the [`Project.set_target()`](https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.28.0/autodoc/api_reference.html#datarobot.models.Project.set_target){ target=_blank } method; you can set any parameter available in this method through the configuration JSON file.

Values in the `training_data` and `score_settings` depend on the intake/output type. The parameters from `score_settings` are passed directly into the [`BatchPredictionJob.score()`](https://datarobot-public-api-client.readthedocs-hosted.com/en/v2.28.0/autodoc/api_reference.html#datarobot.models.BatchPredictionJob.score){ target=_blank } method; you can set any parameter available in this method through the configuration JSON file.

For example, see the local file intake/output and Amazon AWS S3 intake/output JSON configuration samples below:

=== "Local file example"

    **Define `training_data`**

    For local file intake, you should provide the local path to the `training_data`:

    ``` json linenums="1" hl_lines="2"
    {
        "training_data": "include/Diabetes10k.csv",
        "project_name": "Project created from Airflow",
        "autopilot_settings": {
            "target": "readmitted",
            "mode": "quick",
            "max_wait": 3600
        },
        "deployment_label": "Deployment created from Airflow",
        "score_settings": {}
    }
    ```

    **Define `score_settings`**

    For the scoring `intake_settings` and `output_settings`, define the `type` and provide the local `path` to the intake and output data locations:

    ``` json linenums="1" hl_lines="11 12 13 15 16 17"
    {
        "training_data": "include/Diabetes10k.csv",
        "project_name": "Project created from Airflow",
        "autopilot_settings": {
            "target": "readmitted",
            "mode": "quick",
            "max_wait": 3600
        },
        "deployment_label": "Deployment created from Airflow",
        "score_settings": {
            "intake_settings": {
                "type": "localFile",
                "file": "include/Diabetes_scoring_data.csv"
            },
            "output_settings": {
                "type": "localFile",
                "path": "include/Diabetes_predictions.csv"
            }
        }
    }
    ```

    !!! note
        When using the Astro CLI tool to run Airflow, you can place local input files in the `include/` directory. This location is accessible to the Airflow application inside the Docker container.

=== "Amazon AWS S3 example"

    **Define `training_data`**

    For Amazon AWS S3 intake, you can generate a pre-signed URL for the training data file on S3:

    1. In the S3 bucket, click the CSV file.

    2. Click **Object Actions** at the top-right corner of the screen and click **Share with a pre-signed URL**.

    3. Set the expiration time interval and click **Create presigned URL**. The URL is saved to your clipboard.
    

    4. Paste the URL in the JSON configuration file as the `training_data` value:

    ``` json linenums="1" hl_lines="2"
    {
        "training_data": "s3-presigned-url",
        "project_name": "Project created from Airflow",
        "autopilot_settings": {
            "target": "readmitted",
            "mode": "quick",
            "max_wait": 3600
        },
        "deployment_label": "Deployment created from Airflow",
        "datarobot_aws_credentials": "connection-id",
        "score_settings": {}
    }
    ```

    **Define `datarobot_aws_credentials` and `score_settings`**

    For scoring data on Amazon AWS S3, you can add your DataRobot AWS credentials to Airflow:
    
    1. Click **Admin > Connections** to [add an Airflow connection](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html#creating-a-connection-with-the-ui){ target=_blank }. 
    
    2. On the **List Connection** page, click **+ Add a new record**. 
    
    3. In the **Connection Type** list, click **DataRobot AWS Credentials**.

        ![](images/airflow-dr-aws-creds.png)
    
    4. Define a **Connection Id** and enter your Amazon AWS S3 credentials.

    5. Click **Test** to establish a test connection between Airflow and Amazon AWS S3.

    4. When the connection test is successful, click **Save**.
    
        You return to the **List Connections** page, where you should copy the **Conn Id**.
    
    You can now add the **Connection Id** / **Conn Id** value (represented by `connection-id` in this example) to the `datarobot_aws_credentials` field when you [run the DAG](#run-the-datarobot-pipeline-dag). 
    
    For the scoring `intake_settings` and `output_settings`, define the `type` and provide the `url` for the AWS S3 intake and output data locations:

    ``` json linenums="1" hl_lines="12 13 14 16 17 18"
    {
        "training_data": "s3-presigned-url",
        "project_name": "Project created from Airflow",
        "autopilot_settings": {
            "target": "readmitted",
            "mode": "quick",
            "max_wait": 3600
        },
        "deployment_label": "Deployment created from Airflow",
        "datarobot_aws_credentials": "connection-id",
        "score_settings": {
            "intake_settings": {
                "type": "s3",
                "url": "s3://path/to/scoring-data/Diabetes10k.csv",
            },
            "output_settings": {
                "type": "s3",
                "url": "s3://path/to/results-dir/Diabetes10k_predictions.csv",
            }
        }
    }
    ```

    !!! note
        Because this pipeline creates a deployment, the output of the deployment creation step provides the `deployment_id` required for scoring.

## Run the DataRobot pipeline DAG {: #run-the-datarobot-pipeline-dag }

After completing the setup steps above, you can run a DataRobot provider DAG in Airflow using the configuration JSON you assembled:

1. On the Airflow **DAGs** page, locate the DAG pipeline you want to run.

    ![](images/airflow-example-dags.png)

2. Click the run icon for that DAG and click **Trigger DAG w/ config**.

    ![](images/airflow-dag-trigger.png)

3. On the **DAG conf parameters** page, enter the JSON configuration data required by the DAG. In this example, the JSON you assembled in the previous step.

4. Select **Unpause DAG when triggered**, and then click **Trigger**. The DAG starts running:

    ![](images/airflow-datarobot-pipeline-dag.png)

!!! note
    While running Airflow in a Docker container (e.g., using the Astro CLI tool), expect the predictions file created inside the container. To make the predictions available in the host machine, specify the output location in the `include/` directory.
